Lexicon-based Document Representation

نویسندگان

  • Gloria Virginia
  • Hung Son Nguyen
چکیده

It is a big challenge for an information retrieval system (IRS) to interpret the queries made by users, particularly because the common form of query consists of very few terms. Tolerance rough sets models (TRSM), as an extension of rough sets theory, have demonstrated their ability to enrich document representation in terms of semantic relatedness. However, system efficiency is at stake because the weight vector created by TRSM (TRSM-representation) is much less sparse. We mapped the terms occurring in TRSM-representation to terms in the lexicon, hence the final representation of a document was a weight vector consisting only of terms that occurred in the lexicon (LEX-representation). The LEX-representation can be viewed as a compact form of TRSM-representation in a lower dimensional space and eliminates all informal terms previously occurring in TRSM-vector. With these facts, we may expect a more efficient system. We employed recall and precision commonly used in information retrieval to evaluate the effectiveness of LEXrepresentation. Based on our examination, we found that the effectiveness of LEX-representation is comparable with TRSM-representation while the efficiency of LEX-representation should be better than the existing TRSM-representation. We concluded that lexicon-based document representation was another alternative potentially used to represent a document while considering semantics. We are tempted to implement the LEX-representation together with linguistic computation, such as tagging and feature selection, in order to retrieve more relevant terms with high weight. With regard to the TRSM method, enhancing the quality of tolerance class is crucial based on the fact that the ∗Address for correspondence: Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland 28 G. Virginia and H. S. Nguyen / Lexicon-based Document Representation TRSM method is fully reliant on the tolerance classes. We plan to combine other resources such as Wikipedia Indonesia to generate a better tolerance class.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Meaning extraction based on frame representation for Japanese spoken dialogue

This paper describes the issue of meaning extraction based on frame representation for spoken dialogue. The framework of semantic lexicon and the structure of semantic caseframe which focuses on the keywords in phrase is described. The results of the preliminary experiments of meaning extraction using spontaneous speech in the scheduling task and the document retrieval task are shown.

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Mental Representation of Cognates/Noncognates in Persian-Speaking EFL Learners

The purpose of this study was to investigate the mental representation of cognate and noncognate translation pairs in languages with different scripts to test the prediction of dual lexicon model (Gollan, Forster, & Frost, 1997). Two groups of Persian-speaking English language learners were tested on cognate and noncognate translation pairs in Persian-English and English-Persian directions with...

متن کامل

Document relevance calculation based on Lexical cohesion with structure analysis

This paper explores the feasibility of constructing a document relevance calculating model based on lexical cohesion with structure analysis. In this model, by extracting the semanticrelative word clusters in documents according to the lexicon cohesion principle, documents are formalized in expressions which are composed of lexicon chains with structure information. And based on this kind of re...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Fundam. Inform.

دوره 124  شماره 

صفحات  -

تاریخ انتشار 2013